About The DataSet

The project aims to predict the number of wins for a MLB team in 2015 based on 16 features from 2014 data. The features include W, R, AB, H, and 2B, which are statistics related to pitching, scoring, batting, and hitting. The output is a numerical value of wins.The features also include 3B, HR, BB, SO, and SB, which are statistics related to hitting, walking, striking out, and stealing bases. Another feature is RA, which measures the rate of runs allowed or scored.The features also include ER, ERA, CG, SHO, and SV, which are statistics related to runs allowed, pitching performance, and game completion. These features indicate how well a pitcher can prevent the opposing team from scoring.The last feature is E, which measures the number of errors committed by the fielders that allow the offense to gain an advantage. The output is the number of predicted wins (W) for a team in 2015 based on these features. For more details on baseball statistics, you can visit this link: https://en.wikipedia.org/wiki/Baseball_statistics

Importing The Libraries

In [19]:
In [20]:
In [21]:
Out[21]:
W R AB H 2B 3B HR BB SO SB RA ER ERA CG SHO SV E
0 95 724 5575 1497 300 42 139 383 973 104 641 601 3.73 2 8 56 88
1 83 696 5467 1349 277 44 156 439 1264 70 700 653 4.07 2 12 45 86
2 81 669 5439 1395 303 29 141 533 1157 86 640 584 3.67 11 10 38 79
3 76 622 5533 1381 260 27 136 404 1231 68 701 643 3.98 7 9 37 101
4 74 689 5605 1515 289 49 151 455 1259 83 803 746 4.64 7 12 35 86

The data has 5 rows and 17 columns, showing the input features related to offense, defense, and pitching, and the output of predicted wins.The input features are statistics related to scoring, batting, and stealing bases. The output is the number of predicted wins, which depends on the pitcher’s performance.

In [22]:
Out[22]:
W R AB H 2B 3B HR BB SO SB RA ER ERA CG SHO SV E
0 95 724 5575 1497 300 42 139 383 973 104 641 601 3.73 2 8 56 88
1 83 696 5467 1349 277 44 156 439 1264 70 700 653 4.07 2 12 45 86
2 81 669 5439 1395 303 29 141 533 1157 86 640 584 3.67 11 10 38 79
3 76 622 5533 1381 260 27 136 404 1231 68 701 643 3.98 7 9 37 101
4 74 689 5605 1515 289 49 151 455 1259 83 803 746 4.64 7 12 35 86
5 93 891 5509 1480 308 17 232 570 1151 88 670 609 3.80 7 10 34 88
6 87 764 5567 1397 272 19 212 554 1227 63 698 652 4.03 3 4 48 93
7 81 713 5485 1370 246 20 217 418 1331 44 693 646 4.05 0 10 43 77
8 80 644 5485 1383 278 32 167 436 1310 87 642 604 3.74 1 12 60 95
9 78 748 5640 1495 294 33 161 478 1148 71 753 694 4.31 3 10 40 97
10 88 751 5511 1419 279 32 172 503 1233 101 733 680 4.24 5 9 45 119
11 86 729 5459 1363 278 26 230 486 1392 121 618 572 3.57 5 13 39 85
12 85 661 5417 1331 243 21 176 435 1150 52 675 630 3.94 2 12 46 93
13 76 656 5544 1379 262 22 198 478 1336 69 726 677 4.16 6 12 45 94
14 68 694 5600 1405 277 46 146 475 1119 78 729 664 4.14 5 15 28 126
15 100 647 5484 1386 288 39 137 506 1267 69 525 478 2.94 1 15 62 96
16 98 697 5631 1462 292 27 140 461 1322 98 596 532 3.21 0 13 54 122
17 97 689 5491 1341 272 30 171 567 1518 95 608 546 3.36 6 21 48 111
18 68 655 5480 1378 274 34 145 412 1299 84 737 682 4.28 1 7 40 116
19 64 640 5571 1382 257 27 167 496 1255 134 754 700 4.33 2 8 35 90
20 90 683 5527 1351 295 17 177 488 1290 51 613 557 3.43 1 14 50 88
21 83 703 5428 1363 265 13 177 539 1344 57 635 577 3.62 4 13 41 90
22 71 613 5463 1420 236 40 120 375 1150 112 678 638 4.02 0 12 35 77
23 67 573 5420 1361 251 18 100 471 1107 69 760 698 4.41 3 10 44 90
24 63 626 5529 1374 272 37 130 387 1274 88 809 749 4.69 1 7 35 117
25 92 667 5385 1346 263 26 187 563 1258 59 595 553 3.44 6 21 47 75
26 84 696 5565 1486 288 39 136 457 1159 93 627 597 3.72 7 18 41 78
27 79 720 5649 1494 289 48 154 490 1312 132 713 659 4.04 1 12 44 86
28 74 650 5457 1324 260 36 148 426 1327 82 731 655 4.09 1 6 41 92
29 68 737 5572 1479 274 49 186 388 1283 97 844 799 5.04 4 4 36 95
In [23]:
Out[23]:
(30, 17)

The data has 30 rows and 17 columns.

In [24]:
Out[24]:
Index(['W', 'R', 'AB', 'H', '2B', '3B', 'HR', 'BB', 'SO', 'SB', 'RA', 'ER',
       'ERA', 'CG', 'SHO', 'SV', 'E'],
      dtype='object')

The data shows the names of all the features.

In [25]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 17 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   W       30 non-null     int64  
 1   R       30 non-null     int64  
 2   AB      30 non-null     int64  
 3   H       30 non-null     int64  
 4   2B      30 non-null     int64  
 5   3B      30 non-null     int64  
 6   HR      30 non-null     int64  
 7   BB      30 non-null     int64  
 8   SO      30 non-null     int64  
 9   SB      30 non-null     int64  
 10  RA      30 non-null     int64  
 11  ER      30 non-null     int64  
 12  ERA     30 non-null     float64
 13  CG      30 non-null     int64  
 14  SHO     30 non-null     int64  
 15  SV      30 non-null     int64  
 16  E       30 non-null     int64  
dtypes: float64(1), int64(16)
memory usage: 4.1 KB

The data has 30 records, 17 features, and no missing values. The features have different data types: one is float and 16 are integer.

In [26]:
Out[26]:
W      0
R      0
AB     0
H      0
2B     0
3B     0
HR     0
BB     0
SO     0
SB     0
RA     0
ER     0
ERA    0
CG     0
SHO    0
SV     0
E      0
dtype: int64

The data is complete and has no missing values.

In [27]:
Out[27]:
W R AB H 2B 3B HR BB SO SB RA ER ERA CG SHO SV E
count 30.000000 30.000000 30.000000 30.000000 30.000000 30.000000 30.000000 30.000000 30.00000 30.000000 30.000000 30.000000 30.000000 30.000000 30.000000 30.000000 30.000000
mean 80.966667 688.233333 5516.266667 1403.533333 274.733333 31.300000 163.633333 469.100000 1248.20000 83.500000 688.233333 635.833333 3.956333 3.466667 11.300000 43.066667 94.333333
std 10.453455 58.761754 70.467372 57.140923 18.095405 10.452355 31.823309 57.053725 103.75947 22.815225 72.108005 70.140786 0.454089 2.763473 4.120177 7.869335 13.958889
min 63.000000 573.000000 5385.000000 1324.000000 236.000000 13.000000 100.000000 375.000000 973.00000 44.000000 525.000000 478.000000 2.940000 0.000000 4.000000 28.000000 75.000000
25% 74.000000 651.250000 5464.000000 1363.000000 262.250000 23.000000 140.250000 428.250000 1157.50000 69.000000 636.250000 587.250000 3.682500 1.000000 9.000000 37.250000 86.000000
50% 81.000000 689.000000 5510.000000 1382.500000 275.500000 31.000000 158.500000 473.000000 1261.50000 83.500000 695.500000 644.500000 4.025000 3.000000 12.000000 42.000000 91.000000
75% 87.750000 718.250000 5570.000000 1451.500000 288.750000 39.000000 177.000000 501.250000 1311.50000 96.500000 732.500000 679.250000 4.220000 5.750000 13.000000 46.750000 96.750000
max 100.000000 891.000000 5649.000000 1515.000000 308.000000 49.000000 232.000000 570.000000 1518.00000 134.000000 844.000000 799.000000 5.040000 11.000000 21.000000 62.000000 126.000000

The data is complete, numerical, and has different scales. Some features are not normal.

In [28]:
Out[28]:
count mean std min 25% 50% 75% max
W 30.0 80.966667 10.453455 63.00 74.0000 81.000 87.75 100.00
R 30.0 688.233333 58.761754 573.00 651.2500 689.000 718.25 891.00
AB 30.0 5516.266667 70.467372 5385.00 5464.0000 5510.000 5570.00 5649.00
H 30.0 1403.533333 57.140923 1324.00 1363.0000 1382.500 1451.50 1515.00
2B 30.0 274.733333 18.095405 236.00 262.2500 275.500 288.75 308.00
3B 30.0 31.300000 10.452355 13.00 23.0000 31.000 39.00 49.00
HR 30.0 163.633333 31.823309 100.00 140.2500 158.500 177.00 232.00
BB 30.0 469.100000 57.053725 375.00 428.2500 473.000 501.25 570.00
SO 30.0 1248.200000 103.759470 973.00 1157.5000 1261.500 1311.50 1518.00
SB 30.0 83.500000 22.815225 44.00 69.0000 83.500 96.50 134.00
RA 30.0 688.233333 72.108005 525.00 636.2500 695.500 732.50 844.00
ER 30.0 635.833333 70.140786 478.00 587.2500 644.500 679.25 799.00
ERA 30.0 3.956333 0.454089 2.94 3.6825 4.025 4.22 5.04
CG 30.0 3.466667 2.763473 0.00 1.0000 3.000 5.75 11.00
SHO 30.0 11.300000 4.120177 4.00 9.0000 12.000 13.00 21.00
SV 30.0 43.066667 7.869335 28.00 37.2500 42.000 46.75 62.00
E 30.0 94.333333 13.958889 75.00 86.0000 91.000 96.75 126.00
In [29]:
Out[29]:
array([ 95,  83,  81,  76,  74,  93,  87,  80,  78,  88,  86,  85,  68,
       100,  98,  97,  64,  90,  71,  67,  63,  92,  84,  79], dtype=int64)
In [1]:
In [30]:
Out[30]:
<AxesSubplot:xlabel='W', ylabel='Density'>
In [2]:
In [31]:
Out[31]:
<AxesSubplot:xlabel='W', ylabel='Count'>
In [ ]:
In [32]:

EXPLORATORY DATA ANALYSIS(EDA)

In [33]:
In [11]:
Out[11]:
<seaborn.axisgrid.PairGrid at 0x1627124abe0>

The chart below shows how W (Wins) is correlated with other features.

In [34]:
Out[34]:
<AxesSubplot:>

Observation:-

The chart shows that RA, ER, and ERA are very similar and have a strong negative effect on W. We will remove RA and ER and keep ERA for the prediction model.

In [35]:
In [36]:
Out[36]:
(30, 15)

We are left with 15 features after removing two columns from the data.

In [37]:
Out[37]:
W      0
R      0
AB     0
H      0
2B     0
3B     0
HR     0
BB     0
SO     0
SB     0
ERA    0
CG     0
SHO    0
SV     0
E      0
dtype: int64
In [38]:
Out[38]:
<AxesSubplot:>
In [39]:
Out[39]:
<AxesSubplot:>

The data has no missing values.

Finding Correlation()

In [40]:
Out[40]:
W R AB H 2B 3B HR BB SO SB ERA CG SHO SV E
W 1.000000 0.430751 -0.087947 0.037612 0.427797 -0.251118 0.307407 0.484342 0.111850 -0.157234 -0.819600 0.080533 0.471805 0.666530 -0.089485
R 0.430751 1.000000 0.319464 0.482856 0.560084 -0.070072 0.671283 0.402452 -0.054726 0.081367 -0.049281 0.232042 -0.103274 -0.096380 -0.023262
AB -0.087947 0.319464 1.000000 0.739122 0.453370 0.435422 -0.066983 -0.136414 -0.106022 0.372618 0.255551 -0.080876 -0.197321 -0.106367 0.316743
H 0.037612 0.482856 0.739122 1.000000 0.566847 0.478694 -0.090855 -0.118281 -0.398830 0.413444 0.231172 0.147955 -0.145559 -0.130371 -0.033173
2B 0.427797 0.560084 0.453370 0.566847 1.000000 0.220490 0.056292 0.302700 -0.150752 0.195027 -0.254854 0.306675 0.057998 0.171576 0.105754
3B -0.251118 -0.070072 0.435422 0.478694 0.220490 1.000000 -0.430915 -0.454949 -0.141196 0.457437 0.330951 -0.065898 -0.041396 -0.142370 0.126678
HR 0.307407 0.671283 -0.066983 -0.090855 0.056292 -0.430915 1.000000 0.425691 0.359923 -0.136567 -0.090917 0.156502 -0.019119 -0.028540 -0.207597
BB 0.484342 0.402452 -0.136414 -0.118281 0.302700 -0.454949 0.425691 1.000000 0.233652 -0.098347 -0.459832 0.462478 0.426004 0.099445 -0.075685
SO 0.111850 -0.054726 -0.106022 -0.398830 -0.150752 -0.141196 0.359923 0.233652 1.000000 0.030968 -0.180368 -0.093418 0.237721 0.126297 0.155133
SB -0.157234 0.081367 0.372618 0.413444 0.195027 0.457437 -0.136567 -0.098347 0.030968 1.000000 0.126063 -0.020783 -0.106563 -0.183418 0.079149
ERA -0.819600 -0.049281 0.255551 0.231172 -0.254854 0.330951 -0.090917 -0.459832 -0.180368 0.126063 1.000000 -0.009856 -0.630833 -0.607005 0.113137
CG 0.080533 0.232042 -0.080876 0.147955 0.306675 -0.065898 0.156502 0.462478 -0.093418 -0.020783 -0.009856 1.000000 0.241676 -0.367766 -0.140047
SHO 0.471805 -0.103274 -0.197321 -0.145559 0.057998 -0.041396 -0.019119 0.426004 0.237721 -0.106563 -0.630833 0.241676 1.000000 0.221639 -0.115716
SV 0.666530 -0.096380 -0.106367 -0.130371 0.171576 -0.142370 -0.028540 0.099445 0.126297 -0.183418 -0.607005 -0.367766 0.221639 1.000000 -0.025636
E -0.089485 -0.023262 0.316743 -0.033173 0.105754 0.126678 -0.207597 -0.075685 0.155133 0.079149 0.113137 -0.140047 -0.115716 -0.025636 1.000000
In [41]:
Out[41]:
<AxesSubplot:>

ERA is the only negative feature left in the data. The data has low correlation with wins. SV, SHO, BB, 2B, and R are moderately correlated with wins. ERA is also related to SV and SHO.

In [42]:

The data shows that more wins are associated with lower Earned Run Average. Teams with 65 to 88 wins have 4 to 5 runs on average, while teams with 90 to 100 wins have 3 to 4 runs on average. This may indicate that playing aggressively leads to more runs but also more risks.

In [43]:

We can see here that saves have an impact on wins. Higher saves increase the chances of winning.

In [44]:

Shutouts have a slight effect on the data. In some cases, higher shutouts are associated with more wins.

In [45]:

We can also observe that more walks are related to more wins.

In [46]:

Doubles also contribute to wins. More doubles in a match lead to more wins.

In [47]:

Most of the winning matches have scores between 600 and 800, so the data does not show much variation. However, there are some outliers as well.

Dealing with outliers

In [48]:

There are outliers in the data for only 5 columns: R, ERA, SHO, SV, and E.

Z-value approach

In [49]:
In [50]:
(array([5], dtype=int64), array([1], dtype=int64))
In [52]:
With Outliers:: (30, 15)
After Removing Outliers:: (29, 15)

Only one row was eliminated by using the Z-score method.

In [53]:
In [55]:
Z-score of element (5, 1): 3.5096470447193067

In this code, we first import the necessary libraries. Then, we load the "baseball.csv" dataset into a pandas DataFrame called df. We then use the apply function along with zscore to calculate the z-scores for each column in the DataFrame. Next, we convert the resulting DataFrame df_zscore into a NumPy array z. Finally, we access the z-score of a specific element in the array using the NumPy indexing syntax z[5, 1] and print it to the console.

In [54]:
           W         R        AB         H        2B        3B        HR  \
0   1.365409  0.619078  0.847731  1.663685  1.420173  1.041193 -0.787299   
1   0.197838  0.134432 -0.711094 -0.970681  0.127403  1.235809 -0.243967   
2   0.003243 -0.332906 -1.115233 -0.151891  1.588795 -0.223808 -0.723377   
3  -0.483244 -1.146419  0.241522 -0.401088 -0.828122 -0.418423 -0.883181   
4  -0.677839  0.013270  1.280738  1.984081  0.801892  1.722347 -0.403770   
5   1.170814  3.509647 -0.104884  1.361089  1.869832 -1.391501  2.185047   
6   0.587028  1.311430  0.732263 -0.116292 -0.153633 -1.196885  1.545833   
7   0.003243  0.428681 -0.451289 -0.596886 -1.615025 -1.099578  1.705636   
8  -0.094054 -0.765626 -0.451289 -0.365489  0.183611  0.068115  0.107601   
9  -0.288649  1.034489  1.785913  1.628086  1.082929  0.165423 -0.084163   
10  0.684326  1.086415 -0.076017  0.275303  0.239818  0.068115  0.267405   
11  0.489731  0.705622 -0.826562 -0.721484  0.183611 -0.515731  2.121125   
12  0.392433 -0.471376 -1.432772 -1.291077 -1.783647 -1.002270  0.395247   
13 -0.483244 -0.557920  0.400291 -0.436688 -0.715707 -0.904962  1.098383   
14 -1.261625  0.099814  1.208570  0.026106  0.127403  1.430424 -0.563574   
15  1.851896 -0.713699 -0.465723 -0.312089  0.745685  0.749270 -0.851220   
16  1.657301  0.151740  1.656011  1.040693  0.970514 -0.418423 -0.755338   
17  1.560004  0.013270 -0.364688 -1.113079 -0.153633 -0.126500  0.235444   
18 -1.261625 -0.575229 -0.523457 -0.454487 -0.041219  0.262731 -0.595534   
19 -1.650815 -0.834861  0.789997 -0.383288 -0.996744 -0.418423  0.107601   
20  0.878921 -0.090583  0.154920 -0.935081  1.139136 -1.391501  0.427208   
21  0.197838  0.255593 -1.274002 -0.721484 -0.547085 -1.780732  0.427208   
22 -0.969732 -1.302198 -0.768828  0.293103 -2.177099  0.846578 -1.394552   
23 -1.358922 -1.994550 -1.389471 -0.757084 -1.333988 -1.294193 -2.033766   
24 -1.748112 -1.077184  0.183787 -0.525687 -0.153633  0.554654 -1.074945   
25  1.073516 -0.367523 -1.894646 -1.024080 -0.659500 -0.515731  0.746815   
26  0.295136  0.134432  0.703396  1.467888  0.745685  0.749270 -0.883181   
27 -0.191352  0.549843  1.915815  1.610286  0.801892  1.625040 -0.307888   
28 -0.677839 -0.661773 -0.855429 -1.415675 -0.828122  0.457346 -0.499652   
29 -1.261625  0.844092  0.804431  1.343289 -0.041219  1.722347  0.714854   

          BB        SO        SB       ERA        CG       SHO        SV  \
0  -1.534902 -2.697630  0.913883 -0.506955 -0.539806 -0.814629  1.671607   
1  -0.536592  0.154878 -0.601826  0.254598 -0.539806  0.172800  0.249879   
2   1.139144 -0.893982  0.111449 -0.641347  2.772641 -0.320914 -0.654856   
3  -1.160536 -0.168602 -0.690985  0.053010  1.300442 -0.567771 -0.784104   
4  -0.251360  0.105866 -0.022290  1.531318  1.300442  0.172800 -1.042600   
5   1.798742 -0.952796  0.200609 -0.350165  1.300442 -0.320914 -1.171848   
6   1.513510 -0.207812 -0.913883  0.165003 -0.171757 -1.802057  0.637623   
7  -0.910958  0.811641 -1.760897  0.209800 -1.275906 -0.320914 -0.008617   
8  -0.590073  0.605790  0.156029 -0.484557 -0.907856  0.172800  2.188598   
9   0.158660 -0.982204 -0.557246  0.792164 -0.171757 -0.320914 -0.396360   
10  0.604334 -0.148997  0.780144  0.635374  0.564343 -0.567771  0.249879   
11  0.301276  1.409590  1.671738 -0.865333  0.564343  0.419657 -0.525608   
12 -0.607900 -0.962599 -1.404260 -0.036584 -0.539806  0.172800  0.379127   
13  0.158660  0.860654 -0.646405  0.456185  0.932393  0.172800  0.249879   
14  0.105179 -1.266474 -0.245188  0.411388  0.564343  0.913371 -1.947335   
15  0.657815  0.184286 -0.646405 -2.276445 -0.907856  0.913371  2.447094   
16 -0.144398  0.723420  0.646405 -1.671683 -1.275906  0.419657  1.413111   
17  1.745261  2.644696  0.512666 -1.335704  0.932393  2.394514  0.637623   
18 -1.017920  0.497964  0.022290  0.724968 -0.907856 -1.061486 -0.396360   
19  0.479546  0.066657  2.251273  0.836961 -0.539806 -0.814629 -1.042600   
20  0.336930  0.409742 -1.448839 -1.178913 -0.907856  0.666514  0.896119   
21  1.246105  0.939073 -1.181361 -0.753340  0.196293  0.419657 -0.267112   
22 -1.677518 -0.962599  1.270521  0.142604 -1.275906  0.172800 -1.042600   
23  0.033871 -1.384104 -0.646405  1.016150 -0.171757 -0.320914  0.120631   
24 -1.463595  0.252903  0.200609  1.643311 -0.907856 -1.061486 -1.042600   
25  1.673953  0.096064 -1.092202 -1.156515  0.932393  2.394514  0.508375   
26 -0.215706 -0.874377  0.423507 -0.529354  1.300442  1.653943 -0.267112   
27  0.372584  0.625395  2.162114  0.187402 -0.907856  0.172800  0.120631   
28 -0.768343  0.772432 -0.066870  0.299395 -0.907856 -1.308343 -0.267112   
29 -1.445768  0.341125  0.601826  2.427263  0.196293 -1.802057 -0.913352   

           E  
0  -0.461470  
1  -0.607197  
2  -1.117242  
3   0.485758  
4  -0.607197  
5  -0.461470  
6  -0.097152  
7  -1.262970  
8   0.048576  
9   0.194303  
10  1.797303  
11 -0.680061  
12 -0.097152  
13 -0.024288  
14  2.307348  
15  0.121439  
16  2.015894  
17  1.214394  
18  1.578712  
19 -0.315742  
20 -0.461470  
21 -0.315742  
22 -1.262970  
23 -0.315742  
24  1.651576  
25 -1.408697  
26 -1.190106  
27 -0.607197  
28 -0.170015  
29  0.048576  

Interquartile range method(IQR)

In [53]:
Out[53]:
614.25
In [54]:
In [55]:
(30, 15)

The IQR method does not detect any outliers. This is logically possible because the data is very small. We choose the Z-score method and do not remove any outliers because we have very little data.

In [56]:
In [57]:
Out[57]:
(29, 15)

Separating data into X and Y components

In [58]:

Reducing skewness in data

In [59]:
In [60]:
Out[60]:
R     -0.215364
AB     0.169573
H      0.783772
2B    -0.335304
3B     0.090124
HR     0.450862
BB     0.151193
SO    -0.233815
SB     0.494966
ERA    0.016693
CG     0.854980
SHO    0.526943
SV     0.627480
E      0.840271
dtype: float64

Some columns (H, CG, SHO, SV, E) have skewed data. We will adjust the skewness for values above +/-0.5.

We are making these columns (H, CG, SHO, SV, E) more symmetric by reducing their skewness.

In [61]:
In [62]:
Out[62]:
R     -0.215364
AB     0.169573
H      0.000000
2B    -0.335304
3B     0.090124
HR     0.450862
BB     0.151193
SO    -0.233815
SB     0.494966
ERA    0.016693
CG    -0.045947
SHO    0.000529
SV    -0.000925
E      0.065585
dtype: float64

The skewness has been almost eliminated from every column.

Applying standardization to X values

In [63]:

So here not scaling the data it's making lot's of values==0

In [64]:
Out[64]:
0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
dtype: int64
In [65]:
Out[65]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13
count 29.000000 29.000000 29.0 29.000000 29.000000 29.000000 29.000000 29.000000 29.000000 29.000000 29.000000 29.000000 29.000000 29.000000
mean 0.566709 0.498171 0.0 0.560988 0.522031 0.471353 0.471983 0.511104 0.437165 0.486535 0.462820 0.496773 0.545151 0.508101
std 0.237471 0.271595 0.0 0.257760 0.285448 0.227698 0.285046 0.190697 0.257811 0.219594 0.271806 0.244940 0.222023 0.276493
min 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.403141 0.295455 0.0 0.388060 0.361111 0.307692 0.265625 0.341284 0.277778 0.347619 0.230279 0.373171 0.400262 0.361169
50% 0.607330 0.477273 0.0 0.567164 0.527778 0.430769 0.500000 0.533945 0.433333 0.519048 0.495369 0.553751 0.556566 0.508619
75% 0.732984 0.704545 0.0 0.776119 0.722222 0.592308 0.630208 0.622018 0.588889 0.619048 0.668644 0.609418 0.666996 0.612261
max 1.000000 1.000000 0.0 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

We can see that the data has been scaled

Creating training and testing subsets from the data

In [66]:

We look for the optimal random state in the following cell

In [67]:
At random state0, The training accuracy is :-0.9730947151496028
At random state0, The test accuracy is :-0.4803243633239517


At random state1, The training accuracy is :-0.961045232311694
At random state1, The test accuracy is :-0.6891884733447917


At random state2, The training accuracy is :-0.9402440879067483
At random state2, The test accuracy is :-0.2956433349813157


At random state3, The training accuracy is :-0.9636459278075228
At random state3, The test accuracy is :-0.6179134916773945


At random state4, The training accuracy is :-0.9590949693496181
At random state4, The test accuracy is :-0.7969421076715502


At random state5, The training accuracy is :-0.9551231963874857
At random state5, The test accuracy is :-0.536628929785443


At random state6, The training accuracy is :-0.9601627847931741
At random state6, The test accuracy is :-0.7440770072577897


At random state7, The training accuracy is :-0.9464230467928701
At random state7, The test accuracy is :-0.8327718370224058


At random state8, The training accuracy is :-0.9656162432927243
At random state8, The test accuracy is :-0.8112594417584655


At random state9, The training accuracy is :-0.9755702380204887
At random state9, The test accuracy is :-0.7358645148162137


At random state10, The training accuracy is :-0.9790381715722034
At random state10, The test accuracy is :--0.11017031666029364


At random state11, The training accuracy is :-0.9521108583206676
At random state11, The test accuracy is :-0.5685950776521345


At random state12, The training accuracy is :-0.9772717387780184
At random state12, The test accuracy is :-0.6951446779907702


At random state13, The training accuracy is :-0.9700134932226624
At random state13, The test accuracy is :-0.7003301864585941


At random state14, The training accuracy is :-0.9511190089055954
At random state14, The test accuracy is :-0.7888325386674246


At random state15, The training accuracy is :-0.9268973838074166
At random state15, The test accuracy is :-0.9435566074436529


At random state16, The training accuracy is :-0.9791323041199251
At random state16, The test accuracy is :--0.19660975648263213


At random state17, The training accuracy is :-0.9507160383359021
At random state17, The test accuracy is :-0.6796090595337656


At random state18, The training accuracy is :-0.9685528202980584
At random state18, The test accuracy is :-0.26857318911573813


At random state19, The training accuracy is :-0.9603957472379676
At random state19, The test accuracy is :-0.462929211193563


At random state20, The training accuracy is :-0.9594578658676539
At random state20, The test accuracy is :-0.1566096854703133


At random state21, The training accuracy is :-0.9630763886549664
At random state21, The test accuracy is :-0.6029927694932795


At random state22, The training accuracy is :-0.9529562236004397
At random state22, The test accuracy is :-0.651641703106956


At random state23, The training accuracy is :-0.9831902794744897
At random state23, The test accuracy is :-0.637075994768738


At random state24, The training accuracy is :-0.9778465534317279
At random state24, The test accuracy is :--1.2126152648146968


At random state25, The training accuracy is :-0.9595313097041709
At random state25, The test accuracy is :--0.34247026578102613


At random state26, The training accuracy is :-0.9601826919542288
At random state26, The test accuracy is :-0.28747460406299485


At random state27, The training accuracy is :-0.9706600564236557
At random state27, The test accuracy is :-0.7801412244864473


At random state28, The training accuracy is :-0.9712684206077609
At random state28, The test accuracy is :-0.5388060255560909


At random state29, The training accuracy is :-0.9426291621176992
At random state29, The test accuracy is :-0.849123723311783


At random state30, The training accuracy is :-0.951645096527902
At random state30, The test accuracy is :-0.829013442883207


At random state31, The training accuracy is :-0.9796356543277893
At random state31, The test accuracy is :-0.15150480960435775


At random state32, The training accuracy is :-0.9718970733439625
At random state32, The test accuracy is :-0.3408596905657304


At random state33, The training accuracy is :-0.9537296596010983
At random state33, The test accuracy is :-0.7953754817291051


At random state34, The training accuracy is :-0.9706403625510777
At random state34, The test accuracy is :-0.713953461341847


At random state35, The training accuracy is :-0.9680983742439483
At random state35, The test accuracy is :-0.8234589630281659


At random state36, The training accuracy is :-0.9618845315160968
At random state36, The test accuracy is :-0.6077006881709273


At random state37, The training accuracy is :-0.969214511456512
At random state37, The test accuracy is :-0.1397546680268822


At random state38, The training accuracy is :-0.9546076534640046
At random state38, The test accuracy is :-0.547223545027224


At random state39, The training accuracy is :-0.972709612674715
At random state39, The test accuracy is :-0.27938886774745286


At random state40, The training accuracy is :-0.9661566338217432
At random state40, The test accuracy is :-0.4722856783662419


At random state41, The training accuracy is :-0.9527278299451267
At random state41, The test accuracy is :-0.8510923256039526


At random state42, The training accuracy is :-0.9436874759273558
At random state42, The test accuracy is :-0.8445273814519756


At random state43, The training accuracy is :-0.9742875081191817
At random state43, The test accuracy is :-0.1800434006140722


At random state44, The training accuracy is :-0.9583302168189352
At random state44, The test accuracy is :-0.5974000111008917


At random state45, The training accuracy is :-0.9669215772510299
At random state45, The test accuracy is :-0.7166070076683867


At random state46, The training accuracy is :-0.9562778111904172
At random state46, The test accuracy is :-0.6130280269528257


At random state47, The training accuracy is :-0.923359657672783
At random state47, The test accuracy is :-0.961590074727091


At random state48, The training accuracy is :-0.9647918154875401
At random state48, The test accuracy is :-0.5268521454572384


At random state49, The training accuracy is :-0.9607018084759334
At random state49, The test accuracy is :--0.05977783299442185


At random state50, The training accuracy is :-0.9406632163180074
At random state50, The test accuracy is :-0.867390417428149


At random state51, The training accuracy is :-0.9653838190816519
At random state51, The test accuracy is :-0.7521685183828328


At random state52, The training accuracy is :-0.979586392439968
At random state52, The test accuracy is :-0.1077362944254231


At random state53, The training accuracy is :-0.9643149678490127
At random state53, The test accuracy is :-0.6428976721706923


At random state54, The training accuracy is :-0.9593010693145049
At random state54, The test accuracy is :-0.32204077927014885


At random state55, The training accuracy is :-0.9603378830913234
At random state55, The test accuracy is :-0.43479945985936597


At random state56, The training accuracy is :-0.9589669010464636
At random state56, The test accuracy is :-0.8556499530951538


At random state57, The training accuracy is :-0.9592814278552085
At random state57, The test accuracy is :-0.773803471227043


At random state58, The training accuracy is :-0.9757624671359464
At random state58, The test accuracy is :-0.7179536414212728


At random state59, The training accuracy is :-0.960044480615442
At random state59, The test accuracy is :-0.8433826382565536


At random state60, The training accuracy is :-0.9815657651730387
At random state60, The test accuracy is :-0.5148180571177081


At random state61, The training accuracy is :-0.968904655908862
At random state61, The test accuracy is :-0.7262231878767715


At random state62, The training accuracy is :-0.9723860931852902
At random state62, The test accuracy is :-0.7808026450359846


At random state63, The training accuracy is :-0.9815471470695091
At random state63, The test accuracy is :-0.5788110064199209


At random state64, The training accuracy is :-0.9474243041222794
At random state64, The test accuracy is :-0.8315595534902347


At random state65, The training accuracy is :-0.975806179528696
At random state65, The test accuracy is :-0.42929357135735824


At random state66, The training accuracy is :-0.9507459830941206
At random state66, The test accuracy is :-0.8376984449701844


At random state67, The training accuracy is :-0.9476101841154752
At random state67, The test accuracy is :-0.8281153085847592


At random state68, The training accuracy is :-0.9405838649568521
At random state68, The test accuracy is :-0.8435447220266805


At random state69, The training accuracy is :-0.9584099116148761
At random state69, The test accuracy is :-0.6150674389591801


At random state70, The training accuracy is :-0.9533114604493695
At random state70, The test accuracy is :-0.7233480599516725


At random state71, The training accuracy is :-0.9893107431148175
At random state71, The test accuracy is :--0.5031567843841893


At random state72, The training accuracy is :-0.9485465960654653
At random state72, The test accuracy is :-0.48336863777081107


At random state73, The training accuracy is :-0.9595168132812935
At random state73, The test accuracy is :-0.7893796812887011


At random state74, The training accuracy is :-0.961069250160615
At random state74, The test accuracy is :-0.5649455562414625


At random state75, The training accuracy is :-0.9298470557575789
At random state75, The test accuracy is :-0.8398950509728406


At random state76, The training accuracy is :-0.9621888459076996
At random state76, The test accuracy is :-0.582816664989181


At random state77, The training accuracy is :-0.9560887965116833
At random state77, The test accuracy is :-0.801273148587718


At random state78, The training accuracy is :-0.9451897225916028
At random state78, The test accuracy is :-0.819656713009491


At random state79, The training accuracy is :-0.960893346120576
At random state79, The test accuracy is :-0.8077184697155407


At random state80, The training accuracy is :-0.9520707860821686
At random state80, The test accuracy is :-0.755268032219744


At random state81, The training accuracy is :-0.9705297884602037
At random state81, The test accuracy is :-0.7786806074662042


At random state82, The training accuracy is :-0.9716047276578932
At random state82, The test accuracy is :-0.5747226123559879


At random state83, The training accuracy is :-0.9656878192510091
At random state83, The test accuracy is :-0.46240838389042116


At random state84, The training accuracy is :-0.9744510459714362
At random state84, The test accuracy is :--0.02972730437161819


At random state85, The training accuracy is :-0.9500267080298538
At random state85, The test accuracy is :-0.8109512443585534


At random state86, The training accuracy is :-0.9652236474825531
At random state86, The test accuracy is :--0.5311051690206154


At random state87, The training accuracy is :-0.9487515183469536
At random state87, The test accuracy is :-0.7811692995097735


At random state88, The training accuracy is :-0.9844560213569709
At random state88, The test accuracy is :--1.3610690144949094


At random state89, The training accuracy is :-0.9462446737893114
At random state89, The test accuracy is :-0.8652217096222232


At random state90, The training accuracy is :-0.9599433686107215
At random state90, The test accuracy is :-0.5699887453781323


At random state91, The training accuracy is :-0.9827160783120087
At random state91, The test accuracy is :-0.488607733762797


At random state92, The training accuracy is :-0.9840722305072823
At random state92, The test accuracy is :-0.43876791062007325


At random state93, The training accuracy is :-0.9665250384834527
At random state93, The test accuracy is :--1.3805233903539849


At random state94, The training accuracy is :-0.969205980265268
At random state94, The test accuracy is :-0.6283764421043705


At random state95, The training accuracy is :-0.9442943673906647
At random state95, The test accuracy is :-0.7059079022780512


At random state96, The training accuracy is :-0.9697139577489582
At random state96, The test accuracy is :-0.39356552500431086


At random state97, The training accuracy is :-0.9444165887539644
At random state97, The test accuracy is :-0.8263021423098558


At random state98, The training accuracy is :-0.9579091326272794
At random state98, The test accuracy is :-0.1903272738362316


At random state99, The training accuracy is :-0.9411713529485165
At random state99, The test accuracy is :-0.9065946083096769


Since random state=99 yields the best accuracy, we select it as the random state.

In [68]:
In [69]:
Out[69]:
(22, 14)
In [70]:
Out[70]:
(22,)
In [71]:
Out[71]:
(7, 14)
In [72]:
Out[72]:
(7,)

MODEL CONSTRUCTION:

Linear Regression

In [73]:
0.9411713529485165
In [74]:
MSE: 12.672633550148328
MAE: 3.2741362790608854
r2_score: 0.8977213310275122

Ridge regression

In [75]:
0.7900161919802784
In [76]:
MSE: 49.239187061063774
MAE: 6.2005454983800705
r2_score: -0.7668782142053812

Support Vector Regression

SVR(kernel=‘linear’)

In [77]:
0.30528092031070986
In [78]:
MSE: 92.33046990756982
MAE: 8.297178483476788
r2_score: -15.403272063076379

SVR(kernel='poly')

In [79]:
0.9623520035446798
In [80]:
MSE: 25.082457868810074
MAE: 4.444342971619469
r2_score: 0.7896849748653584

SVR(kernel_rbf)

In [81]:
0.19209180923383828
In [82]:
MSE: 116.13067806934706
MAE: 9.435955372289124
r2_score: -89.25744991027202

Random Forest Regressor

In [83]:
0.9440521173783166
In [84]:
MSE: 56.61992857142854
MAE: 6.8128571428571405
r2_score: -0.4713584894773417

Decision Tree Regressor

In [85]:
1.0
In [86]:
MSE: 48.857142857142854
MAE: 6.285714285714286
r2_score: 0.24905897114178166

Gradient Boosting Regressor

In [87]:
0.9999999858489446
In [88]:
MSE: 54.10564120650036
MAE: 6.501534794172067
r2_score: -0.07310449126914786

Cross Validation Score

In [89]:
LR  : 27.033589, 21.860738
R  : 35.880015, 28.804724
svr  : 74.591235, 58.448260
svr_p  : 31.647625, 16.838681
svr_r  : 95.650187, 68.864422
RF  : 42.594893, 27.840806
DTR  : 56.250000, 35.508782
GBR  : 46.239162, 31.227040

Based on all the metrics scores, we choose LinearRegression as the final model.

HYPER PARAMETER TUNING:

GridsearchCV

In [90]:
In [91]:
In [92]:
In [93]:
Fitting 4 folds for each of 2 candidates, totalling 8 fits
In [94]:
Out[94]:
array([76.45621352, 65.09807686, 95.08772745, 84.24426739, 83.10533837,
       92.65600567, 82.13923134])
In [95]:
MSE: 25.82765799660208
MAE: 3.7618490017363433
r2_score: 0.7007977209358371
In [96]:
Out[96]:
array([76.45621352, 65.09807686, 95.08772745, 84.24426739, 83.10533837,
       92.65600567, 82.13923134])
In [97]:
Out[97]:
<AxesSubplot:xlabel='W', ylabel='Density'>
In [98]:
Out[98]:
[<matplotlib.lines.Line2D at 0x1620d30d4f0>]

We select LinearRegressor as the best model after using GridSearchCV.

Model Saving

In [99]:
In [100]:
Out[100]:
['Baseball Case Study_Project.obj']
In [15]:
Out[15]:
Index(['W', 'R', 'AB', 'H', '2B', '3B', 'HR', 'BB', 'SO', 'SB', 'RA', 'ER',
       'ERA', 'CG', 'SHO', 'SV', 'E'],
      dtype='object')
In [16]:
In [17]:
Out[17]:
<AxesSubplot:>

Correlation analysis, First independent vs dependent R,HR,2B,BB,SHO, SV have a strong positive correlation with the target variable (W) AB, H, 3B, SO,SB,CG,E have a weak correlation with the target variable (both positive and negative) RA, ER, ERA have a strong negative correlation with the target variable and with each other. These features can bias the result. We need to decide whether to drop any of them. AB and H are highly correlated to each other at 74%

EDA

In [18]:
Out[18]:
Index(['R', 'AB', 'H', '2B', '3B', 'HR', 'BB', 'SO', 'SB', 'RA', 'ER', 'ERA',
       'CG', 'SHO', 'SV', 'E', 'W'],
      dtype='object')
In [19]:
Out[19]:
<AxesSubplot:xlabel='R', ylabel='W'>

#sns.relplot(x=“R”, y=“W”, data=df);

There is a positive relationship between Runs scored and Wins, the more runs, the higher the chances of winning.

In [20]:
Out[20]:
<AxesSubplot:xlabel='AB', ylabel='W'>

The data is scattered and does not show any clear trend, it is not useful for predicting the chances of winning, it has a weak correlation.

In [21]:
Out[21]:
<AxesSubplot:xlabel='H', ylabel='W'>
In [4]:
In [22]:
Out[22]:
<AxesSubplot:xlabel='2B', ylabel='W'>
In [5]:
In [23]:
Out[23]:
<AxesSubplot:xlabel='3B', ylabel='W'>
In [6]:
In [24]:
Out[24]:
<AxesSubplot:xlabel='HR', ylabel='W'>
In [7]:
In [25]:
Out[25]:
Index(['R', 'AB', 'H', '2B', '3B', 'HR', 'BB', 'SO', 'SB', 'RA', 'ER', 'ERA',
       'CG', 'SHO', 'SV', 'E', 'W'],
      dtype='object')
In [26]:
Out[26]:
<AxesSubplot:xlabel='BB', ylabel='W'>
In [8]:
In [27]:
Out[27]:
<AxesSubplot:xlabel='SO', ylabel='W'>
In [ ]:
In [28]:
Out[28]:
<AxesSubplot:xlabel='SB', ylabel='W'>
In [ ]:
In [29]:
Out[29]:
<AxesSubplot:xlabel='RA', ylabel='W'>
In [ ]:
In [30]:
Out[30]:
<AxesSubplot:xlabel='ER', ylabel='W'>
In [ ]:
In [31]:
Out[31]:
<AxesSubplot:xlabel='ERA', ylabel='W'>
In [ ]:
In [32]:
Out[32]:
<AxesSubplot:xlabel='CG', ylabel='W'>
In [ ]:
In [33]:
Out[33]:
<AxesSubplot:xlabel='SHO', ylabel='W'>
In [ ]:
In [34]:
Out[34]:
<AxesSubplot:xlabel='SV', ylabel='W'>
In [ ]:
In [35]:
Out[35]:
<AxesSubplot:xlabel='E', ylabel='W'>
In [ ]:

Summary: Some features have a weak association with the target variable, while others have a strong inverse relationship.

In addition, there is multicollinearity, meaning some independent features are directly related to each other.

In [36]:
Out[36]:
Index(['R', 'AB', 'H', '2B', '3B', 'HR', 'BB', 'SO', 'SB', 'RA', 'ER', 'ERA',
       'CG', 'SHO', 'SV', 'E', 'W'],
      dtype='object')
In [37]:
Out[37]:
R AB H 2B 3B HR BB SO SB RA ER ERA CG SHO SV E W
0 724 5575 1497 300 42 139 383 973 104 641 601 3.73 2 8 56 88 95
1 696 5467 1349 277 44 156 439 1264 70 700 653 4.07 2 12 45 86 83
2 669 5439 1395 303 29 141 533 1157 86 640 584 3.67 11 10 38 79 81
In [38]:
Out[38]:
<AxesSubplot:xlabel='R', ylabel='W'>
In [39]:
Out[39]:
R AB H 2B 3B HR BB SO SB RA ER ERA CG SHO SV E W
0 724 5575 1497 300 42 139 383 973 104 641 601 3.73 2 8 56 88 95
1 696 5467 1349 277 44 156 439 1264 70 700 653 4.07 2 12 45 86 83
2 669 5439 1395 303 29 141 533 1157 86 640 584 3.67 11 10 38 79 81
3 622 5533 1381 260 27 136 404 1231 68 701 643 3.98 7 9 37 101 76
4 689 5605 1515 289 49 151 455 1259 83 803 746 4.64 7 12 35 86 74
5 891 5509 1480 308 17 232 570 1151 88 670 609 3.80 7 10 34 88 93
6 764 5567 1397 272 19 212 554 1227 63 698 652 4.03 3 4 48 93 87
7 713 5485 1370 246 20 217 418 1331 44 693 646 4.05 0 10 43 77 81
8 644 5485 1383 278 32 167 436 1310 87 642 604 3.74 1 12 60 95 80
9 748 5640 1495 294 33 161 478 1148 71 753 694 4.31 3 10 40 97 78
10 751 5511 1419 279 32 172 503 1233 101 733 680 4.24 5 9 45 119 88
11 729 5459 1363 278 26 230 486 1392 121 618 572 3.57 5 13 39 85 86
12 661 5417 1331 243 21 176 435 1150 52 675 630 3.94 2 12 46 93 85
13 656 5544 1379 262 22 198 478 1336 69 726 677 4.16 6 12 45 94 76
14 694 5600 1405 277 46 146 475 1119 78 729 664 4.14 5 15 28 126 68
15 647 5484 1386 288 39 137 506 1267 69 525 478 2.94 1 15 62 96 100
16 697 5631 1462 292 27 140 461 1322 98 596 532 3.21 0 13 54 122 98
17 689 5491 1341 272 30 171 567 1518 95 608 546 3.36 6 21 48 111 97
18 655 5480 1378 274 34 145 412 1299 84 737 682 4.28 1 7 40 116 68
19 640 5571 1382 257 27 167 496 1255 134 754 700 4.33 2 8 35 90 64
20 683 5527 1351 295 17 177 488 1290 51 613 557 3.43 1 14 50 88 90
21 703 5428 1363 265 13 177 539 1344 57 635 577 3.62 4 13 41 90 83
22 613 5463 1420 236 40 120 375 1150 112 678 638 4.02 0 12 35 77 71
23 573 5420 1361 251 18 100 471 1107 69 760 698 4.41 3 10 44 90 67
24 626 5529 1374 272 37 130 387 1274 88 809 749 4.69 1 7 35 117 63
25 667 5385 1346 263 26 187 563 1258 59 595 553 3.44 6 21 47 75 92
26 696 5565 1486 288 39 136 457 1159 93 627 597 3.72 7 18 41 78 84
27 720 5649 1494 289 48 154 490 1312 132 713 659 4.04 1 12 44 86 79
28 650 5457 1324 260 36 148 426 1327 82 731 655 4.09 1 6 41 92 74
29 737 5572 1479 274 49 186 388 1283 97 844 799 5.04 4 4 36 95 68

Distinct Values

In [40]:
unique values of feature  R =  28
unique values of feature  AB =  29
unique values of feature  H =  29
unique values of feature  2B =  22
unique values of feature  3B =  23
unique values of feature  HR =  27
unique values of feature  BB =  29
unique values of feature  SO =  29
unique values of feature  SB =  27
unique values of feature  RA =  30
unique values of feature  ER =  30
unique values of feature  ERA =  30
unique values of feature  CG =  9
unique values of feature  SHO =  12
unique values of feature  SV =  20
unique values of feature  E =  21
unique values of feature  W =  24
In [41]:
Out[41]:
array([ 2, 11,  7,  3,  0,  1,  5,  6,  4], dtype=int64)
In [42]:
Out[42]:
<AxesSubplot:xlabel='R', ylabel='W'>
In [43]:
Out[43]:
<AxesSubplot:xlabel='CG', ylabel='W'>
In [9]:
In [44]:
Out[44]:
<AxesSubplot:xlabel='ER', ylabel='RA'>
In [10]:

Identifying multicolinearity using VIF

In [45]:
Out[45]:
R AB H 2B 3B HR BB SO SB RA ER ERA CG SHO SV E
0 724 5575 1497 300 42 139 383 973 104 641 601 3.73 2 8 56 88
1 696 5467 1349 277 44 156 439 1264 70 700 653 4.07 2 12 45 86
In [46]:
In [47]:
In [48]:
In [49]:
In [50]:
Out[50]:
features vif
0 R 11.522370
1 AB 13.311532
2 H 10.070668
3 2B 4.019297
4 3B 3.294146
5 HR 10.079902
6 BB 3.806098
7 SO 2.652401
8 SB 2.102684
9 RA 191.839155
10 ER 1680.387145
11 ERA 1222.722240
12 CG 3.059904
13 SHO 3.654331
14 SV 5.798850
15 E 2.186219
In [11]:
In [51]:
In [52]:
Out[52]:
R AB H 2B 3B HR BB SO SB RA ERA CG SHO SV E
0 724 5575 1497 300 42 139 383 973 104 641 3.73 2 8 56 88
1 696 5467 1349 277 44 156 439 1264 70 700 4.07 2 12 45 86
In [53]:
In [54]:
In [55]:
Out[55]:
features vif
0 R 11.158733
1 AB 5.863764
2 H 9.628749
3 2B 3.786446
4 3B 3.293109
5 HR 9.057309
6 BB 3.791451
7 SO 2.607389
8 SB 1.849280
9 RA 122.133235
10 ERA 119.328698
11 CG 2.741188
12 SHO 3.410561
13 SV 3.203815
14 E 2.107345
In [ ]:
In [56]:
In [57]:
In [58]:
In [59]:
Out[59]:
features vif
0 R 10.987898
1 AB 4.399954
2 H 8.941874
3 2B 3.729413
4 3B 3.142717
5 HR 7.882841
6 BB 3.468244
7 SO 2.155496
8 SB 1.819823
9 ERA 4.951981
10 CG 2.723370
11 SHO 3.227759
12 SV 2.948133
13 E 2.040676
In [60]:
Out[60]:
R AB H 2B 3B HR BB SO SB ERA CG SHO SV E
0 724 5575 1497 300 42 139 383 973 104 3.73 2 8 56 88
1 696 5467 1349 277 44 156 439 1264 70 4.07 2 12 45 86
2 669 5439 1395 303 29 141 533 1157 86 3.67 11 10 38 79
In [61]:
Out[61]:
<AxesSubplot:>
In [ ]:
In [62]:
Out[62]:
R      1.200786
AB     0.183437
H      0.670254
2B    -0.230650
3B     0.129502
HR     0.516441
BB     0.158498
SO    -0.156065
SB     0.479893
ERA    0.053331
CG     0.736845
SHO    0.565790
SV     0.657524
E      0.890132
dtype: float64

The outliers

In [63]:
In [64]:
Out[64]:
Index(['R', 'AB', 'H', '2B', '3B', 'HR', 'BB', 'SO', 'SB', 'ERA', 'CG', 'SHO',
       'SV', 'E'],
      dtype='object')
In [ ]:
In [65]:
Out[65]:
818.75
In [66]:
In [67]:
In [68]:
In [69]:
In [70]:
In [71]:
In [ ]:
In [72]:
Out[72]:
R      0.284282
AB     0.183437
H      0.670254
2B    -0.230650
3B     0.129502
HR     0.516441
BB     0.158498
SO    -0.156065
SB     0.479893
ERA    0.037969
CG     0.736845
SHO    0.218030
SV     0.612333
E      0.504019
dtype: float64
In [ ]:

The transformation

In [73]:
In [74]:
In [75]:
Out[75]:
R      0.000000
AB     0.000000
H      0.000000
2B    -0.035315
3B    -0.072933
HR    -0.000065
BB    -0.007760
SO     0.041170
SB    -0.010455
ERA    0.001204
CG    -0.059785
SHO   -0.017889
SV     0.001270
E      0.032939
dtype: float64
In [76]:
Out[76]:
R AB H 2B 3B HR BB SO SB ERA CG SHO SV E
0 0.0 0.0 0.0 1.477685 1.033103 -0.764626 -1.610915 -2.569896 0.939708 -0.502797 -0.359844 -0.827928 1.564693 -0.357505
1 0.0 0.0 0.0 0.084269 1.203320 -0.158581 -0.502749 0.134913 -0.539693 0.261440 -0.359844 0.258244 0.361185 -0.560947
In [ ]:

The standardization

In [77]:

Distinguish between independent and dependent properties.

In [78]:
In [79]:
In [80]:
Out[80]:
(30, 14)
In [81]:
Out[81]:
(30,)

Applying Machine Learning

In [82]:
In [83]:
At random state  30 The model performance very well
At random state:  30
Test R2 score is:  0.84
Train R2 score is:  0.84
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 

At random state  47 The model performance very well
At random state:  47
Test R2 score is:  0.85
Train R2 score is:  0.85
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 

At random state  99 The model performance very well
At random state:  99
Test R2 score is:  0.85
Train R2 score is:  0.85
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 

At random state  175 The model performance very well
At random state:  175
Test R2 score is:  0.86
Train R2 score is:  0.86
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 

In [84]:
Out[84]:
LinearRegression()
In [85]:
In [88]:
In [89]:
mean_absolute_error of  LinearRegression() model 2.7264065961746673
mean_square_error of LinearRegression() model 13.319637960999827
R2 Score of LinearRegression() model 86.25266724208733
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 


mean_absolute_error of  Ridge() model 2.294706185531917
mean_square_error of Ridge() model 11.152294329870855
R2 Score of Ridge() model 88.48960447605072
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 


mean_absolute_error of  Lasso() model 2.851798496410254
mean_square_error of Lasso() model 9.180635618736224
R2 Score of Lasso() model 90.524573329286
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 


mean_absolute_error of  DecisionTreeRegressor() model 8.166666666666666
mean_square_error of DecisionTreeRegressor() model 84.5
R2 Score of DecisionTreeRegressor() model 12.786697247706424
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 


mean_absolute_error of  SVR() model 8.005345608904014
mean_square_error of SVR() model 91.1232721063577
R2 Score of SVR() model 5.950751266373944
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 


mean_absolute_error of  KNeighborsRegressor() model 6.233333333333332
mean_square_error of KNeighborsRegressor() model 55.53999999999997
R2 Score of KNeighborsRegressor() model 42.67660550458719
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 


mean_absolute_error of  RandomForestRegressor() model 4.658333333333334
mean_square_error of RandomForestRegressor() model 31.309850000000015
R2 Score of RandomForestRegressor() model 67.68478784403669
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 


mean_absolute_error of  XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             n_estimators=100, n_jobs=None, num_parallel_tree=None,
             predictor=None, random_state=None, ...) model 7.472799936930339
mean_square_error of XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             n_estimators=100, n_jobs=None, num_parallel_tree=None,
             predictor=None, random_state=None, ...) model 81.98539532333962
R2 Score of XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             n_estimators=100, n_jobs=None, num_parallel_tree=None,
             predictor=None, random_state=None, ...) model 15.38204611123205
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 


mean_absolute_error of  ElasticNet() model 2.97480798289115
mean_square_error of ElasticNet() model 13.269638207715806
R2 Score of ElasticNet() model 86.30427249203643
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 


mean_absolute_error of  SGDRegressor() model 2.5074561752065634
mean_square_error of SGDRegressor() model 13.532296378175312
R2 Score of SGDRegressor() model 86.0331803436264
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 


mean_absolute_error of  BaggingRegressor() model 4.733333333333337
mean_square_error of BaggingRegressor() model 25.483333333333363
R2 Score of BaggingRegressor() model 73.69839449541283
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 


mean_absolute_error of  AdaBoostRegressor() model 5.616666666666667
mean_square_error of AdaBoostRegressor() model 37.315000000000005
R2 Score of AdaBoostRegressor() model 61.4868119266055
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 


mean_absolute_error of  GradientBoostingRegressor() model 4.980965265127812
mean_square_error of GradientBoostingRegressor() model 46.147551835947866
R2 Score of GradientBoostingRegressor() model 52.3706460408795
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 


Cross Validation

In [90]:
In [91]:
mean_square of  LinearRegression() model 13.319637960999827
cross Validation score of  LinearRegression()  is  -55.10832658669832
**************************************************
mean_square of  Ridge() model 11.152294329870855
cross Validation score of  Ridge()  is  -47.703110230419824
**************************************************
mean_square of  Lasso() model 9.180635618736224
cross Validation score of  Lasso()  is  -38.00120143901657
**************************************************
mean_square of  DecisionTreeRegressor() model 84.5
cross Validation score of  DecisionTreeRegressor()  is  -101.73333333333333
**************************************************
mean_square of  SVR() model 91.1232721063577
cross Validation score of  SVR()  is  -101.43654219683
**************************************************
mean_square of  KNeighborsRegressor() model 55.53999999999997
cross Validation score of  KNeighborsRegressor()  is  -80.90266666666668
**************************************************
mean_square of  RandomForestRegressor() model 31.309850000000015
cross Validation score of  RandomForestRegressor()  is  -71.57627
**************************************************
mean_square of  XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             n_estimators=100, n_jobs=None, num_parallel_tree=None,
             predictor=None, random_state=None, ...) model 81.98539532333962
cross Validation score of  XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             n_estimators=100, n_jobs=None, num_parallel_tree=None,
             predictor=None, random_state=None, ...)  is  -78.86998451989881
**************************************************
mean_square of  ElasticNet() model 13.269638207715806
cross Validation score of  ElasticNet()  is  -47.05931987676341
**************************************************
mean_square of  SGDRegressor() model 13.532296378175312
cross Validation score of  SGDRegressor()  is  -53.97276048904761
**************************************************
mean_square of  BaggingRegressor() model 25.483333333333363
cross Validation score of  BaggingRegressor()  is  -68.26366666666668
**************************************************
mean_square of  AdaBoostRegressor() model 37.315000000000005
cross Validation score of  AdaBoostRegressor()  is  -65.350948643655
**************************************************
mean_square of  GradientBoostingRegressor() model 46.147551835947866
cross Validation score of  GradientBoostingRegressor()  is  -83.87449729411017
**************************************************
In [92]:
Root mean_square of  LinearRegression() model 3.649607918804406
cross Validation score of root mean square  LinearRegression()  is  7.42349827148214
**************************************************
Root mean_square of  Ridge() model 3.3395051025370295
cross Validation score of root mean square  Ridge()  is  6.906743822556315
**************************************************
Root mean_square of  Lasso() model 3.029956372414663
cross Validation score of root mean square  Lasso()  is  6.164511451771062
**************************************************
Root mean_square of  DecisionTreeRegressor() model 9.192388155425117
cross Validation score of root mean square  DecisionTreeRegressor()  is  10.389096848780136
**************************************************
Root mean_square of  SVR() model 9.545851041492199
cross Validation score of root mean square  SVR()  is  10.071570989514496
**************************************************
Root mean_square of  KNeighborsRegressor() model 7.452516353554682
cross Validation score of root mean square  KNeighborsRegressor()  is  8.994590967168362
**************************************************
Root mean_square of  RandomForestRegressor() model 5.59552052985243
cross Validation score of root mean square  RandomForestRegressor()  is  8.006575214576245
**************************************************
Root mean_square of  XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             n_estimators=100, n_jobs=None, num_parallel_tree=None,
             predictor=None, random_state=None, ...) model 9.054578693861997
cross Validation score of root mean square  XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             n_estimators=100, n_jobs=None, num_parallel_tree=None,
             predictor=None, random_state=None, ...)  is  8.880877463398468
**************************************************
Root mean_square of  ElasticNet() model 3.6427514611507337
cross Validation score of root mean square  ElasticNet()  is  6.859979582824092
**************************************************
Root mean_square of  SGDRegressor() model 3.6786269691523916
cross Validation score of root mean square  SGDRegressor()  is  7.31012582955064
**************************************************
Root mean_square of  BaggingRegressor() model 5.048101953539901
cross Validation score of root mean square  BaggingRegressor()  is  8.503077874119072
**************************************************
Root mean_square of  AdaBoostRegressor() model 6.1086004943849455
cross Validation score of root mean square  AdaBoostRegressor()  is  8.066031721155328
**************************************************
Root mean_square of  GradientBoostingRegressor() model 6.7931989398182555
cross Validation score of root mean square  GradientBoostingRegressor()  is  9.172925595679391
**************************************************

XGBRegressor Hypertuning

In [93]:
In [94]:
In [95]:
Out[95]:
GridSearchCV(cv=5,
             estimator=XGBRegressor(base_score=None, booster=None,
                                    callbacks=None, colsample_bylevel=None,
                                    colsample_bynode=None,
                                    colsample_bytree=None,
                                    early_stopping_rounds=None,
                                    enable_categorical=False, eval_metric=None,
                                    feature_types=None, gamma=None, gpu_id=None,
                                    grow_policy=None, importance_type=None,
                                    interaction_constraints=None,
                                    learning_rate=None, m...
                                    min_child_weight=None, missing=nan,
                                    monotone_constraints=None, n_estimators=100,
                                    n_jobs=None, num_parallel_tree=None,
                                    predictor=None, random_state=None, ...),
             n_jobs=-1,
             param_grid={'colsample_bytree': [0.3, 0.4, 0.5, 0.7],
                         'gamma': [0.01, 0.05, 0.1, 0.2, 0.3],
                         'learning_rate': [0.01, 0.05, 0.1, 0.2, 0.3, 0.5],
                         'max_depth': [3, 4, 5, 6, 8],
                         'min_child_weight': [1, 3, 5, 7]},
             scoring='neg_mean_squared_error')
In [97]:
Out[97]:
{'colsample_bytree': 0.3,
 'gamma': 0.01,
 'learning_rate': 0.2,
 'max_depth': 4,
 'min_child_weight': 3}
In [98]:
Out[98]:
76.23911029390486
In [12]:

Model saving in pickle format

In [99]:
Out[99]:
['Baseball.pkl']
In [ ]: